Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

dvchristianbors
Copy link

@dvchristianbors dvchristianbors commented Apr 6, 2022

fix(chartdata): improve querying performance for long where ... in (...) statements.

SUMMARY

As described in andialbrecht/sqlparse#621, calling sqlparse.format and sqlparse.parse functions with grouping queries will cause quadratic runtime, ending in very long computation times for queries with large statements, e.g., long IN statements with a large number of keys.

In Chart Data, the sqlparse functions are needlessly called, this has several reasons:

  1. The ChartData component will automatically generate most of the structure of the final query. It is possible, however, that users can add custom SQLs for filters, columns, etc.) - However, validation of the query will also happen in connectors/sqla/models/get_query_str_extended, so if erroneos sql code is produced, we will still receive an error.
  2. The structure of the finished SQL query will be valid in every case, and the query will also never consist of multiple consecutive queries. Hence, it is not necessary to parse the query for a possible sequence of multiple queries. (which happens in models/core.py.
  3. The SQL automatically created by compile_sqla_query(sqlaq.sqla_query) will already result in a readable sql. Reformating the sql using sqlparser.format(sql, reindent=True) to reindent will only have a minor improvement in readability while sacrificing computation time.

Looking at the issue #19567 , querying times vary significantly depending on IN clause key size. Omitting the parse() and format() functions from the code used by ChartData will solve this issue, without having any impediments.

BEFORE/AFTER ANIMATED GIF

Before

superset_querying_performance_before

After

superset_querying_performance_after

TESTING INSTRUCTIONS

Please see issue #19567 for detailed steps to reproduce.
I also added a test case that showcases the computational penalty in this commit

ADDITIONAL INFORMATION

  • Has associated issue: Fixes Significant Increase in Querying Time Due to Sqlparse #19567
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@dvchristianbors dvchristianbors changed the title Disable sqlparse calls for Chart Data requests to Improve Querying Performance fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance Apr 6, 2022
@villebro
Copy link
Member

villebro commented Apr 7, 2022

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

@dvchristianbors
Copy link
Author

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

Thanks for your comments. I agree that this would be the best option. I forked sqlparse and merged the PRs which supposedly addressed the issues, however I cannot confirm that they did improve performance (at least for the tested use cases).

I will further investigate the issue in sqlparse itself, and propose new changes as soon as I found a better solution.
@villebro Should I post my updates and revised solution here?

@dvmarkusvogl
Copy link

we currently rely heavily on sqlparse for some very critical validation logic
SQL-Parses self-description:
(pypi): sqlparse is a non-validating SQL parser for Python
(github) A non-validating SQL parser module for Python

They seem very keen on not being used for validation.
IMO the pythonic way would be to properly handle the database-response, and the fix above at least reduces the existing complexity.

Also, the project had 3 days of activity since October 2020, so I wouldn't expect much from them or rely to heavily from a project that hat 200+ open issues.

@rumbin
Copy link
Contributor

rumbin commented Nov 20, 2023

We also ran into this performance issue lately. Having some rather complex queries on a dfashboard drastically increases the dashboard load time, as the inefficiency of sqlparse keeps the gunicorn workers busy. This, in-turn, causes queries of the dashboard to be queued until a worker is free again.
So, all in all, the effect on dashboard load times is immense, unless we throw lots of gunicorn workers and many CPU cores at this problem.

@villebro, given the fact that there is virtually no activity in the sqlparse project and considering @dvmarkusvogl's comment above, I feel that following the approach suggested in this PR is the best option we have.
What do you think?

PS: We'll try patching sqlparse by virtue of andialbrecht/sqlparse#710 and report back...

@rumbin
Copy link
Contributor

rumbin commented Nov 21, 2023

Update on our investigations:

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

@dvchristianbors
Copy link
Author

Update on our investigations:

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

Thanks for letting us know. Did you test these changes with a recent branch? Our changes were made on a rather old version.

@WojtekWaga
Copy link

Hey @dvchristianbors, I applied this patch manually as there were some incompatible changes in the codebase and it helped. Time from clicking chart refresh to getting the query sent to db dropped significantly. Also show query is much faster now but (as expected) it shows uglier query now.

@rusackas
Copy link
Member

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

@dvchristianbors
Copy link
Author

dvchristianbors commented Feb 15, 2024

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

Yes, I would still be interested to finish this feature. If a rebase next week is still fine.

@rusackas
Copy link
Member

Perfectly fine. Thanks again!

@dvchristianbors
Copy link
Author

Perfectly fine. Thanks again!

Hello @rusackas, the PR was updated and is now ready for review.

@rusackas
Copy link
Member

Looks like the pre-commit hooks need to be run to solve at least some (if not all) of the CI issues.

@rusackas
Copy link
Member

Approving CI run 🤞

@john-bodley
Copy link
Member

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

@rafehi rafehi mentioned this pull request Feb 28, 2024
16 tasks
@pull-request-size pull-request-size bot added size/M and removed size/L labels Mar 5, 2024
@dvchristianbors
Copy link
Author

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

I replaced the usage of sqlparse with sqlglot for splitting up multiple sql statements. However, sqlglot does not include the formatting function that was specifically removed due to performance issues here.

@betodealmeida
Copy link
Member

Note that it is possible to pretty-format a SQL query with sqlglot, but you need to know the dialect:

https://github.com/apache/superset/pull/26767/files#diff-30f4c6ffdcb1f78a9e1ebbb60e1f297b379c181534d5a185a4cd37b1b16ac6f8R409

@dvchristianbors do you mind leaving a TODO comment to tentatively re-add the SQL formatting once SIP-117 is merged? (I saw the PR is approved but has conflicts, I'll work on them.

@rusackas
Copy link
Member

rusackas commented Jul 8, 2024

@dvchristianbors / @betodealmeida just checking in here... some folks on Slack were wondering if this is likely to merge. Looks like it needs a rebase at least, but also not sure if more needs to be done here regarding SIP-117.

@rams3sh
Copy link

rams3sh commented Sep 30, 2024

Our dashboards and charts use complex queries, resulting in high loading times. After investigating, we found that most of the time was consumed by sqlparse.parse and sqlparse.format. While searching for solutions, I came across this PR.

Since this problem is not solved in the upstream as of now, I have put down a hacky solution that worked for us so that it can benefit others who may be in same boat as me .

We're using the 4.0.2 Superset container image, but the patch in this PR wasn't directly compatible with our version. To address this, I used unittest.mock.patch to centrally patch all calls to sqlparse.parse and sqlparse.format. I didn't want to touch in multiple places and invite new issues / bugs in superset.

Following the PR's suggestion, I had sqlparse.format return the same SQL string which was passed to it without allowing to format anything. For sqlparse.parse, since it was called multiple times for the same query from various places, I implemented basic lru caching to avoid redundant calls.

I've copied the patched app.py code block below. Sharing this in case anyone else encounters a similar issue and needs a quick, stable solution.

The base app.py file for below patch is taken from 4.0.2 tag. The below code block is to be replaced with original superset/app.py.

Patched app.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

import logging
import os
from typing import Optional
from functools import wraps, lru_cache
from unittest.mock import patch

from flask import Flask
import sqlparse

from superset.initialization import SupersetAppInitializer

logger = logging.getLogger(__name__)

def auto_patch(mock_mapping):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            patches = []
            try:
                # Apply patches before executing the function
                for function_path, side_effect in mock_mapping.items():
                    patcher = patch(function_path, side_effect=side_effect)
                    patches.append(patcher.start())  # Start each patch
                
                # Execute the original function with patches in place
                return func(*args, **kwargs)
            finally:
                # Ensure patches are cleaned up after function execution
                for patcher in patches:
                    patcher.stop()
        
        return wrapper
    return decorator

# Getting original references to the to-be intercepted function before hand
# so as to avoid call recursion once the patch is done
orig_sqlparse_parse_func = sqlparse.parse
orig_sqlparse_format_func = sqlparse.format

@lru_cache(maxsize=100)
def cached_sqlparse_parse(*args, **kwargs):
    return orig_sqlparse_parse_func(*args, **kwargs)

# It will be useful to cache in case, original sqlparse.format function is 
# used
# @lru_cache(maxsize=100)
def mock_sqlparse_format(sql, *args, **kwargs):
    # One can retain the sqlparse_format in case it is required for their use case. 
    # value =  orig_sqlparse_format_func(sql, *args, **kwargs)
    value = sql 
    return value

def mock_sqlparse_parse(*args, **kwargs):
    value = cached_sqlparse_parse(*args, **kwargs)
    return value

patch_map = {
    'sqlparse.format': mock_sqlparse_format,
    'sqlparse.parse': mock_sqlparse_parse
}


# Patch all respective original entities with it's counter part mock entities
@auto_patch(patch_map)
def create_app(superset_config_module: Optional[str] = None) -> Flask:
    app = SupersetApp(__name__)

    try:
        # Allow user to override our config completely
        config_module = superset_config_module or os.environ.get(
            "SUPERSET_CONFIG", "superset.config"
        )
        app.config.from_object(config_module)

        app_initializer = app.config.get("APP_INITIALIZER", SupersetAppInitializer)(app)
        app_initializer.init_app()

        return app

    # Make sure that bootstrap errors ALWAYS get logged
    except Exception as ex:
        logger.exception("Failed to create app")
        raise ex


class SupersetApp(Flask):
    pass

@Airliquide76
Copy link

@rams3sh your app.py does the work for us on a really long and complex trino query 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Significant Increase in Querying Time Due to Sqlparse
10 participants